Doña Ana County
Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
Jia, Mumin, Diaz-Rodriguez, Jairo
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- (2 more...)
Self-supervised and Multi-fidelity Learning for Extended Predictive Soil Spectroscopy
Sun, Luning, Safanelli, José L., Sanderman, Jonathan, Georgiou, Katerina, Brungard, Colby, Grover, Kanchan, Hopkins, Bryan G., Liu, Shusen, Bremer, Timo
We propose a self-supervised machine learning (SSML) framework for multi-fidelity learning and extended predictive soil spectroscopy based on latent space embeddings. A self-supervised representation was pretrained with the large MIR spectral library and the Variational Autoencoder algorithm to obtain a compressed latent space for generating spectral embeddings. At this stage, only unlabeled spectral data were used, allowing us to leverage the full spectral database and the availability of scan repeats for augmented training. We also leveraged and froze the trained MIR decoder for a spectrum conversion task by plugging it into a NIR encoder to learn the mapping between NIR and MIR spectra in an attempt to leverage the predictive capabilities contained in the large MIR library with a low cost portable NIR scanner. This was achieved by using a smaller subset of the KSSL library with paired NIR and MIR spectra. Downstream machine learning models were then trained to map between original spectra, predicted spectra, and latent space embeddings for nine soil properties. The performance of was evaluated independently of the KSSL training data using a gold-standard test set, along with regression goodness-of-fit metrics. Compared to baseline models, the proposed SSML and its embeddings yielded similar or better accuracy in all soil properties prediction tasks. Predictions derived from the spectrum conversion (NIR to MIR) task did not match the performance of the original MIR spectra but were similar or superior to predictive performance of NIR-only models, suggesting the unified spectral latent space can effectively leverage the larger and more diverse MIR dataset for prediction of soil properties not well represented in current NIR libraries.
- North America > United States > Washington > King County > Bellevue (0.04)
- North America > United States > Utah > Utah County > Provo (0.04)
- North America > United States > Oregon > Benton County > Corvallis (0.04)
- (4 more...)
- Food & Agriculture > Agriculture (1.00)
- Energy (0.93)
- Government > Regional Government > North America Government > United States Government (0.68)
Renovating Names in Open-Vocabulary Segmentation Benchmarks Haiwen Huang
Names are essential to both human cognition and vision-language models. Open-vocabulary models utilize class names as text prompts to generalize to categories unseen during training. However, the precision of these names is often overlooked in existing datasets. In this paper, we address this underexplored problem by presenting a framework for "renovating" names in open-vocabulary segmentation benchmarks (RENOV A TE). Our framework features a renaming model that enhances the quality of names for each visual segment. Through experiments, we demonstrate that our renovated names help train stronger open-vocabulary models with up to 15% relative improvement and significantly enhance training efficiency with improved data quality. We also show that our renovated names improve evaluation by better measuring misclassification and enabling fine-grained model analysis.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > Kansas > Butler County (0.04)
- (2 more...)
- Research Report > Experimental Study (0.93)
- Overview (0.93)
- Information Technology (0.46)
- Transportation > Ground > Road (0.46)
The CHASM-SWPC Dataset for Coronal Hole Detection & Analysis
Beck, Cutter, Smith, Evan, Katuwal, Khagendra, Kafle, Rudra, Whitehill, Jacob
Coronal holes (CHs) are low-activity, low-density solar coronal regions with open magnetic field lines (Cranmer 2009). In the extreme ultraviolet (EUV) spectrum, CHs appear as dark patches. Using daily hand-drawn maps from the Space Weather Prediction Center (SWPC), we developed a semi-automated pipeline to digitize the SWPC maps into binary segmentation masks. The resulting masks constitute the CHASM-SWPC dataset, a high-quality dataset to train and test automated CH detection models, which is released with this paper. We developed CHASM (Coronal Hole Annotation using Semi-automatic Methods), a software tool for semi-automatic annotation that enables users to rapidly and accurately annotate SWPC maps. The CHASM tool enabled us to annotate 1,111 CH masks, comprising the CHASM-SWPC-1111 dataset. We then trained multiple CHRONNOS (Coronal Hole RecOgnition Neural Network Over multi-Spectral-data) architecture (Jarolim et al. 2021) neural networks using the CHASM-SWPC dataset and compared their performance. Training the CHRONNOS neural network on these data achieved an accuracy of 0.9805, a True Skill Statistic (TSS) of 0.6807, and an intersection-over-union (IoU) of 0.5668, which is higher than the original pretrained CHRONNOS model Jarolim et al. (2021) achieved an accuracy of 0.9708, a TSS of 0.6749, and an IoU of 0.4805, when evaluated on the CHASM-SWPC-1111 test set.
- Europe > United Kingdom (0.14)
- North America > United States > North Dakota > Oliver County > Center (0.05)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- (2 more...)
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
Li, Haoyuan, Shen, Zhengyuan, Jeoung, Sullam, Chen, Yueyan, Li, Jiayu, Zhu, Qi, Wang, Shuai, Ioannidis, Vassilis, Rangwala, Huzefa
As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
- North America > United States > North Carolina (0.04)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (2 more...)
- North America > United States > North Dakota > Renville County (0.14)
- North America > Canada > Alberta > Census Division No. 13 > Athabasca County (0.14)
- North America > United States > Texas (0.14)
- (5 more...)
- Transportation > Ground > Road (0.93)
- Leisure & Entertainment > Sports (0.68)
Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation
Diaz-Rodriguez, Jairo, Jia, Mumin
Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (13 more...)
SDEC: Semantic Deep Embedded Clustering
Rahman, Mohammad Wali Ur, Nevarez, Ric, Mim, Lamia Tasnim, Hariri, Salim
The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.
- North America > United States > Arizona > Pima County > Tucson (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Ohio (0.04)
- (6 more...)
- Information Technology (0.67)
- Government (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning
Lin, Yijun, Chen, Theresa, Brungard, Colby, Sabine, Grunwald, Ives, Sue, Macander, Matt, Nawrocki, Timm, Chiang, Yao-Yi, Jelinski, Nic
Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region's ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at https://github.com/knowledge-computing/Peatland-permafrost.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- North America > United States > Alaska > Fairbanks North Star Borough > Fairbanks (0.14)
- (10 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.93)
FUSE: Measure-Theoretic Compact Fuzzy Set Representation for Taxonomy Expansion
Xu, Fred, Jiang, Song, Huang, Zijie, Luo, Xiao, Zhang, Shichang, Chen, Adrian, Sun, Yizhou
Taxonomy Expansion, which models complex concepts and their relations, can be formulated as a set representation learning task. The generalization of set, fuzzy set, incorporates uncertainty and measures the information within a semantic concept, making it suitable for concept modeling. Existing works usually model sets as vectors or geometric objects such as boxes, which are not closed under set operations. In this work, we propose a sound and efficient formulation of set representation learning based on its volume approximation as a fuzzy set. The resulting embedding framework, Fuzzy Set Embedding (FUSE), satisfies all set operations and compactly approximates the underlying fuzzy set, hence preserving information while being efficient to learn, relying on minimum neural architecture. We empirically demonstrate the power of FUSE on the task of taxonomy expansion, where FUSE achieves remarkable improvements up to 23% compared with existing baselines. Our work marks the first attempt to understand and efficiently compute the embeddings of fuzzy sets.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (4 more...)